Introduction to Generalized Linear Models

ESA and Physalia Collaboration
4th March 2024

Dr Philip Leftwich

Welcome!

physalia logo

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in … .


Social media at .

UEA logo

What to expect during this workshop

These workshop will run for 2 hours each.

  • Combines slides, live coding examples, and exercises for you to participate in.

  • Ask questions in the chat throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Schitts Creek questions gif

Source: giphy.com

What you need for this workshop

  • You are comfortable with simple operations in R.

  • You know how to perform linear regression.

Workshop resources

Data

Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

Data

How to download:

covid <- readr::read_csv(
  "https://raw.githubusercontent.com/nrennie/f4sg-gams/main/data/covid.csv"
  )


  • See data/ folder on GitHub for pre-processing.

General linear models

A recap on ordinary least squares

Model assumptions

The equation of a linear model (lm) is given by:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where:

  • \(y_i\) is the predicted value of the response variable.

  • \(\beta_0\) is the intercept.

  • \(\beta_1\) is the slope coefficient, representing the change in the response variable for a one-unit change in the explanatory variable.

  • \(x_i\) is the value of the explanatory variable.

  • \(\epsilon_i\) represents the residuals of the model, which are assumed to be drawn from a normal distribution with mean zero and a constant variance. This implies that the residuals, or the distances between the observed values and the values predicted by the linear model, can be modeled by drawing random values from a normal distribution.

Normal distribution of the residuals

Another way to write the lm equation is:

\[ y_i \sim N(\mu = \beta_0 + \beta_1 X_i, \sigma^2) \]

Which literally means that \[y_i\] is drawn from a normal distribution with parameters \[\mu\] (which depends on \[x_i\]) and \[\sigma\] (which has the same value for all \[Y\]s).

What are generalised linear models?

Put janka ls here with log and Gamma

Theoretical foundations

Probability distributions and exponential families

Estimation and inference

Assessing model fit

Likelihood

AIC

Is a generalized linear model always best?

janka_ls <- lm(hardness ~ dens, data = janka)

summary(janka_ls)

Call:
lm(formula = hardness ~ dens, data = janka)

Residuals:
    Min      1Q  Median      3Q     Max 
-338.40  -96.98  -15.71   92.71  625.06 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183.1 on 34 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9478 
F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16

Model diagnostics

plot(janka_ls, which=2)
plot(janka_ls, which=3)

Violations of assumptions

library(lmtest)

# Breusch-Pagan Test of Homoscedasticity
bptest(janka_ls)

    studentized Breusch-Pagan test

data:  janka_ls
BP = 4.4668, df = 1, p-value = 0.03456
# Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(janka_ls))

    Shapiro-Wilk normality test

data:  residuals(janka_ls)
W = 0.93641, p-value = 0.03933

Troublesome transformations

See screenshots

janka_sqrt <- lm(sqrt(hardness) ~ dens, data = janka)

plot(janka_sqrt, which=2)
plot(janka_sqrt, which=3)

janka_log <- lm(log(hardness) ~ dens, data = janka)

plot(janka_log, which=2)
plot(janka_log, which=3)

Polynomials

janka_poly <- lm(log(hardness) ~ poly(dens, 2), data = janka)

plot(janka_poly, which=2)
plot(janka_poly, which=3)

Summary polynomial

summary(janka_poly)

Call:
lm(formula = log(hardness) ~ poly(dens, 2), data = janka)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22331 -0.05708 -0.01104  0.07500  0.18871 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.1362     0.0168 424.896  < 2e-16 ***
poly(dens, 2)1   3.3862     0.1008  33.603  < 2e-16 ***
poly(dens, 2)2  -0.5395     0.1008  -5.354 6.49e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1008 on 33 degrees of freedom
Multiple R-squared:  0.9723,    Adjusted R-squared:  0.9706 
F-statistic: 578.9 on 2 and 33 DF,  p-value: < 2.2e-16

ggplot(janka, aes(x = hardness, y = log(dens))) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ x)) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title
ggplot(janka, aes(x = hardness, y = log(dens))) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ poly(x, 2))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Box-cox

Weighted least squares

weights = 1/sqrt(hardness): This argument specifies the weights to be used in the fitting process. In this case, the weights are set to be inversely proportional to the square root of hardness.

1/sqrt(hardness): This means that observations with higher values of hardness will have lower weights, and observations with lower values of hardness will have higher weights. This could be used to give more importance to certain observations in the model fitting process. data = janka: This specifies the data frame (janka) from which the variables (sqrt(hardness) and dens) are to be taken.

Putting it all together, the code janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka) fits a linear model where the response variable is the square root of hardness, the predictor variable is dens, and weights are inversely proportional to the square root of hardness. This is often referred to as a weighted least squares (WLS) regression model, where observations are weighted differently based on certain criteria (in this case, the square root of hardness).

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)

plot(janka_wls, which=2)
plot(janka_wls, which=3)

prediction_data <- data.frame(dens = sort(unique(janka$dens)))
predictions <- predict(janka_wls, newdata = prediction_data, interval = "confidence", level = 0.95)

# Adding predictions and confidence intervals to the dataframe
prediction_data$wls_pred = predictions[, "fit"]
prediction_data$wls_pred.lwr = predictions[, "lwr"]
prediction_data$wls_pred.upr = predictions[, "upr"]

ggplot(janka) +
     geom_ribbon(data = prediction_data, aes(x = dens, ymin = wls_pred.lwr, ymax = wls_pred.upr), alpha = 0.8, fill = "lightgray")+
    geom_line(data = prediction_data, aes(x = dens, y = wls_pred), color = "blue")+
  geom_point(aes(x = dens, y = sqrt(hardness)))

Generalized approach

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

summary(janka_glm)

Call:
glm(formula = hardness ~ dens, family = gaussian(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.60869    1.47660   1.767   0.0863 .  
dens         0.75194    0.02722  27.622   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25498.81)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:   866960  on 34  degrees of freedom
AIC: 471.38

Number of Fisher Scoring iterations: 4

Gamma distribution

janka_gamma <- glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt"))

summary(janka_gamma)

Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(Gamma(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Compare

AIC(janka_ls)
AIC(janka_sqrt)
AIC(janka_wls)
AIC(janka_log)
AIC(janka_glm)
AIC(janka_poly)
library(MuMIn)

# r squared
[1] 481.2123
[1] 150.18
[1] 146.4216
[1] -37.69338
[1] 471.3758
[1] -58.20239

GLMs for binary data

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

  • Presence/Absence of a species

  • Presence/Absence of a disease

  • Success/Failure to observe behaviour

  • Survival/Death of organisms

Wish to determine if \(P/A∼ Variable\)

Called a logistic regression or logit model

Binary variables

Bernoulli Distribution

Expected values from lm

Probability distribution

:::: columns

$E(Y) = p $

$ y_i = p * (1-p) $

Mean of distribution Probability (p) of observing an outcome

Variance of observed responses As observed responses approach 0 or 1, the variance of the distribution decreases

GLM function

Show odd and transformed odds

Binomial distribution

$ logit(p) = $

$ y_i = Binomial(n,p) $

  • It is the collection of Bernoulli trials for the same event

  • It is represented by the number of Bernoulli trials \(n\)

  • It is also the probability of an event in each trial \(p\)

$ P(X = s) = C(n, s) p^s (1 - p)^{n - s}$

Where:

  • \(n\) is the total number of trials of an event.
  • \(s\) corresponds to the number of times an event should occur.
  • \(p\) is the probability that the event will occur.
  • \((1 - p)\) is the probability that the event will not occur.
  • \(C\) term represents combinations, calculated as \(C(n, s) = \frac{n!}{s!(n - s)!}\), representing the number of ways to choose \(s\) successes out of \(n\) trials.

Example

:::: columns

# According to the situation:
# n = 5,
# s = 5,
# p = 0.95,
# (1 — p) = 0.05

p = 1 * (0.95)^(5) * (0.05)^(0)

p
[1] 0.7737809
[1] 0.7737809

Binomial VS Bernoulli Keypoints!

Bernoulli deals with the outcome of the single trial of the event, whereas Binomial deals with the outcome of the multiple trials of the single event.

Bernoulli is used when the outcome of an event is required for only one time, whereas the Binomial is used when the outcome of an event is required multiple times.

Weights

Overdispersion

Binomial \(P(X = s) = C(n, s) \cdot p^s \cdot (1 - p)^{n - s}\)

Quasibinomial \(P(X = s) = C(n, s) \cdot p(p+k\theta)^{s-1} \cdot (1 - pk\theta)^{n - s}\)

Where:

  • \(n\) is the total number of trials of an event.

  • \(s\) corresponds to the number of times an event should occur.

  • \(p\) is the probability that the event will occur.

  • \((1 - p)\) is the probability that the event will not occur.

  • \(C\) term represents combinations, calculated as \(C(n, s) = \frac{n!}{s!(n - s)!}\), representing the number of ways to choose \(s\) successes out of \(n\) trials.

  • \(\theta\) term describes additional variance outside of the Binomial distribution

GLMs for count data

Poisson

Count or rate data are ubiquitous in the life sciences (e.g number of parasites per microlitre of blood, number of species counted in a particular area). These type of data are discrete and non-negative.

A useful distribution to model abundance data is the Poisson distribution:

a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution.

GLM

Recall the simple linear regression case (i.e a GLM with a Gaussian error structure and identity link). For the sake of clarity let’s consider a single explanatory variable where \(\mu\) is the mean for Y:

\(\mu_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn\)

\(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\)

  • The mean function is unconstrained, i.e the value of \(\beta_0 + \beta_1X\) can range from \(-\infty\) to \(+\infty\).

  • If we want to model count data we therefore want to constrain this mean to be positive only.

  • Mathematically we can do this by taking the logarithm of the mean (the log is the default link for the Poisson distribution).

  • We then assume our count data variance to be Poisson distributed (a discrete, non-negative distribution), to obtain our Poisson regression model (to be consistent with the statistics literature we will rename \(\mu\) to \(\lambda\)):

GLM Poisson

Poisson equation

Poisson limitations

The Poisson distribution has the following characteristics:

  • Discrete variable, defined on the range \(0, 1, \dots, \infty\).

  • A single rate parameter \(\lambda\), where \(\lambda > 0\).

  • Mean = \(\lambda\)

  • Variance = \(\lambda\)

Poisson: Case study

Cuckoo

In a study by Kilner et al. (1999), the authors studied the begging rate of nestlings in relation to total mass of the brood of reed warbler chicks and cuckoo chicks. The question of interest is:

How does nestling mass affect begging rates between the different species?

This model is inadequate. It is predicting negative begging calls within the range of the observed data, which clearly does not make any sense.

cuckoo_lm <- lm(Beg ~ Mass + Species + Mass:Species, data = cuckoo)

Diagnostics

Let us display the model diagnostics plots for this linear model.

  • Curvature

  • Funnelling effect

Biological data and distributions

fitdistrplus?

Poisson model

We should therefore try a different model structure.

The response variable in this case is a classic count data: discrete and bounded below by zero (i.e we cannot have negative counts). We will therefore try a Poisson model using the canonical log link function for the mean:

  • λ varies with x (mass) which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

  • Predicted values will now be integers instead of fractions

  • The model will never predict negative values (Poisson is strictly positive)

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

where \(M_i\) is nestling mass and \(S_i\) a dummy variable

\(S_i = \left\{\begin{array}{ll}1 & \text{if } i \text{ is warbler},\\0 & \text{otherwise}.\end{array}\right.\)

The term \(M_iS_i\) is an interaction term. Think of this as an additional explanatory variable in our model. Effectively it lets us have different slopes for different species (without an interaction term we assume that both species have the same slope for the relationship between begging rate and mass, and only the intercept differ).

Regression lines

The mean regression lines for the two species look like this:

  • Cuckoo (\(S_i=0\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 0) + (\beta_3 \times M_i \times 0)\\\log{\lambda} & = \beta_0 + \beta_1 M_i\end{aligned}\)

  • Warbler (\(S_i=1\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 1) + (\beta_3 \times M_i \times 1)\\\log{\lambda} & = \beta_0 + \beta_1 M_i + \beta_2 + \beta_3M_i\\\end{aligned}\)

Fit the Poisson model

Fit the model with the interaction term in R:

cuckoo_glm1 <- glm(Beg ~ Mass + Species + Mass:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm1)

Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Note there appears to be a negative interaction effect for Species:Mass, indicating that Begging calls do not increase with mass as much as you would expect for Warblers as compared to Cuckoos.

Exercise

Open exercises/exercise_04.R for prompts.

  • Fit a Poisson model to the cuckoo data

  • Look at the residual plots - how have they changed?

10:00

Poisson model check

Chi-squared tests

Offset - for rates

Poisson limitations

So for the Poisson regression case we assume that the mean and variance are the same. Hence, as the mean increases, the variance increases also (heteroscedascity). This may or may not be a sensible assumption so watch out! Just because a Poisson distribution usually fits well for count data, doesn’t mean that a Gaussian distribution can’t always work.

Recall the link function between the predictors and the mean and the rules of logarithms (if \(\log{\lambda} = k\)(value of predictors), then \(\lambda = e^k\)):

Overdispersion

\[ Y_i = Poisson(\phi \lambda_i) \]

Where:

  • \(Yi\) is the response variable.
  • \(\phi\) is the dispersion parameter, which adjusts the variance of the distribution

F distributions

Negative Binomial

Negative Binomial

Negative binomial GLMs are favored when overdispersion is high.

It has two parameters, \(μ\) and \(k\). \(k\) controls for the dispersion parameter (smaller \(k\) indicates higher dispersion). It corresponds to a combination of two distributions (Poisson and gamma).

It assumes that the \(Y_i\) are Poisson distributed with the mean \(μ_i\) assumed to follow a gamma distribution:

\[ E(Y_i) = μ_i \\ \text{Var}(Y_i) = μ_i + μ_i^2/k \]

Fitting a negative binomial in R

Zero-inflation

When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts.

RETURN TO CUCKOO DATA!

Other distributions

Gamma

Second column

Simulate

Survival

Sometimes relationships are easy to spot…

Sometimes relationships are easy to spot…

Other times not so much!

What are generalised additive models?

Let’s start with linear models…

\(\mathbb{E}(Y) = \alpha + x_1 + x_2 + ... + x_p\)


Then generalised linear models…

\(g(\mathbb{E}(Y)) = \alpha + x_1 + x_2 + ... + x_p\)

What are generalised additive models?

Generalised additive models are essentially just a sum of smooth functions of the explanatory variables.

\(g(\mathbb{E}(Y)) = \alpha + s(x_1) + s(x_2) + ... + s(x_p)\)

What are generalised additive models?

GAMs complexity diagram

Generalised additive models in R

There are several packages for fittings GAMs in R. The two main packages are:

  • {gam}

  • {mgcv}

We’ll be using {mgcv}.

Hint: don’t load both packages at the same time!

Live demo!



See examples/example_01.R for full code.

Exercise 1

Open exercises/exercise_01.R for prompts.

  • Load the packages required for this workshop.

  • Load the data for the exercises. Subset it to look at France only.

  • Split the data into training and testing sets.

  • Create a plot of confirmed cases over time.

  • Fit a linear model using the gam() function in {mgcv} with confirmed as the outcome, and date_obs as the explanatory variable.

10:00

See exercise_solutions/exercise_solutions_01.R for full solution.

Generalised additive models in R

Fitting smooth functions

We want to fit some sort of smooth function to each predictor. We can add together basis functions.

Different basis functions

Different basis functions with fitted line

Fitting smooth functions

gam_0 <- gam(confirmed ~ date_obs, data = gbr_train)
gam_1 <- gam(confirmed ~ s(date_obs), data = gbr_train)


There are different smooth classes available: s(), te(), ti(), and t2(). Smooth classes are invoked directly by s() terms, or as building blocks for tensor product smoothing via te(), ti() or t2() terms.


There are also different types of splines, specified by the bs argument. See ?smooth.terms for details.

Multiple terms

fit <- gam(y ~ s(x1) + s(x2) + x3, data = data)
  • You can include multiple predictors in gam().

  • You can use different types of smoothing for each one.

  • You can include non-smooth terms (esp. categorical) variables as well.

How smooth is smooth enough?

GAMs showing two levels of smoothing

Additional arguments

  • Level of smoothing can be controlled with sp in gam() or s().

  • You can choose the dimension of the basis function using k in s().

Hint: look at ?gam and ?s to see the wide range of arguments.

Live demo!



See examples/example_02.R for full code.

Exercise 2

Open exercises/exercise_02.R for prompts.

  • Fit a GAM using s() and gam() to the confirmed cases in France.

  • Try adding different additional terms.

  • Try varying the smoothing parameter sp for each variable.

  • Plot the smooth functions for the predictors, and the outcome.

10:00

See exercise_solutions/exercise_solutions_02.R for full solution.

Predicting with GAMs

After you’ve fitted a GAM…

  • Is it a reasonable model?

  • Is it better than other models?

  • How well does it fit to the data?

  • What are the forecasted values?

  • How do we interpret the model?

Live demo!



See examples/example_03.R for full code.

Exercise 3

Open exercises/exercise_03.R for prompts.

  • Run gam.check() on your GAM from Exercise 2. Do you need to refit?

  • Obtain the fitted values.

  • Make a forecast for the range of the test data.

  • Inspect the contribution of different terms.

10:00

See exercise_solutions/exercise_solutions_03.R for full solution.

When GAMs don’t quite work…

Common problems

  • Unstable estimates: when a smooth term can be approximated by some combination of the others.

  • Computationally expensive for large data sets: see bam().

  • Lack of independence between observations: see gamm().

  • Unstable predictions outside the range of training data: see {mvgam}

Are observations independent?

acf(gbr_data$confirmed, main = "ACF plot of confirmed cases")

Generalized additive mixed effect models (GAMMs)

  • Like GAMs, GAMMs allow for non-linear relationships between predictors and the response variable by fitting smooth functions to each predictor.

  • GAMMs also allow for the inclusion of random effects, which capture the variability of observations within groups or clusters.

  • This includes adding correlation structures.

Live demo!



See examples/example_04.R for full code.

Exercise 4

Open exercises/exercise_04.R for prompts.

  • Look at the ACF and PACF plots of the residuals from your previous GAM.

  • Try fitting a GAMM model instead.

10:00

See exercise_solutions/exercise_solutions_04.R for full solution.

Additional resources

  • GLMs resource list: